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Abstract 



■ We consider ttie problem of online linear regression on arbitrary deterministic sequences 
when the ambient dimension d can be much larger than the number of time rounds T. We 
introduce the notion of sparsity regret bound, which is a deterministic online counterpart of 
recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such 
regret bounds for an online-learning algorithm called SeqSEW and based on exponential 
weighting and data-driven truncation. In a second part we apply a parameter-free version of 
this algorithm to the stochastic setting (regression model with random design). This yields 
risk bounds of the same flavor as in Dalalyan and Tsybakov (2011) but which solve two 
questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic 
factor) to the unknown variance of the noise if the latter is Gaussian. We also address the 

■ regression model with fixed design. 

■ Keywords: sparsity, online linear regression, individual sequences, adaptive regret bounds 

in 

^ 1. Introduction 

Sparsity has been extensively studied in the stochastic setting over the past decade. This 
notion is key to address statistical problems that are high-dimensional, i.e., where the 
number of unknown parameters is of the same order or even much larger than the number 
of observations. This is the case in many contemporary applications such as computational 
biology (e.g., analysis of DNA sequences), collaborative filtering (e.g., Netflix, Amazon), 
satellite and hyperspectral imaging, and high-dimensional econometrics (e.g., cross-country 
growth regression problems). 

A key message about sparsity is that, although high-dimensional statistical inference is 
impossible in general (i.e., without further assumptions), it becomes statistically feasible if 
among the many unknown parameters, only few of them are non-zero. Such a situation is 
called a sparsity scenario and has been the focus of many theoretical, computational, and 
practical works over the past decade in the stochastic setting. On the theoretical side, most 
sparsity-related risk bounds take the form of the so-called sparsity oracle inequalities, i.e., 
risk bounds expressed in terms of the number of non-zero coordinates of the oracle vector. 
As of now, such theoretical guarantees have only been proved under stochastic assumptions. 
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In this paper we address the prediction possibihties under a sparsity scenario in both 
deterministic and stochastic settings. Wc first prove that theoretical guarantees similar to 
sparsity oracle inequalities can be obtained in a deterministic online setting, namely, online 
linear regression on individual sequences. The newly obtained deterministic prediction 
guarantees are called sparsity regret bounds. We prove such bounds for an online-learning 
algorithm which, in its most sophisticated version, is fully automatic in the sense that no 
preliminary knowledge is needed for the choice of its tuning parameters. In the second 
part of this paper, we apply our sparsity regret bounds — of deterministic nature — to the 
stochastic setting (regression model with random design). One of our key results is that, 
thanks to our online tuning techniques, these deterministic bounds imply sparsity oracle 
inequalities that are adaptive to the unknown variance of the noise (up to logarithmic 
factors) when the latter is Gaussian. In particular, this solves an open question raised by 
Dalalyan and Tsybakov (2011). 

In the next paragraphs, we introduce our main setting and motivate the notion of spar- 
sity regret bound from an online-learning viewpoint. We then detail our main contributions 
with respect to the statistical literature and the machine-learning literature. 

Introduction of a deterministic counterpart of sparsity oracle inequalities 

We consider the problem of online linear regression on arbitrary deterministic sequences. A 
forecaster has to predict in a sequential fashion the values G M of an unknown sequence 
of observations given some input data Xf & X and some base forecasters ipj : X ^ M, 
1 ^ i ^ c?, on the basis of which he outputs a prediction yt G M. The quality of the 
predictions is assessed by the square loss. The goal of the forecaster is to predict almost as 
well as the best linear forecaster u-(p = Yl'j=i '^j'^ji where u G M"^, i.e., to satisfy, uniformly 
over all individual sequences {xt,yt)if^t!iT, a regret bound of the form 

T C T 

for some regret term Aj^ ,i(u) that should be as small as possible and, in particular, sublinear 
in T. (For the sake of introduction, we omit the dependencies of Ax^diu) on the amplitudes 

maxi^t^T \\xt\\oo and maxi^j^T \yt\-) 

In this setting the version of the sequential ridge regression forecaster studied by Azoury 
and Warmuth (2001) and Vovk (2001) can be tuned to have a regret ATdM of order at 

/llll2\ .... ' 

most dln[T ||u||2j. When the ambient dimension d is much larger than the number of time 
rounds T, the latter regret bound may unfortunately be larger than T and is thus somehow 
trivial. Since the regret bound dlnT is optimal in a certain sense (see, e.g., the lower bound 
of Vovk 2001, Theorem 2), additional assumptions are needed to get interesting theoretical 
guarantees. 

A natural assumption, which has already been extensively studied in the stochastic 
setting, is that there is a sparse linear combination u* (i.e., with s <^ T/(lnT) non-zero 
coefficients) which has a small cumulative square loss. If the forecaster knew in advance the 

support J{u*) = {j : n* 7^ 0} of u* , he could apply the same forecaster as above but only 
to the s-dimensional linear subspace |u G : Vj ^ J{u*),Uj = O}. The regret bound of 
this "oracle" would be roughly of order s In T and thus sublinear in T. Under this sparsity 
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scenario, a sublinear regret thus seems possible, though, of course, the aforementioned regret 
bound slnT can only be used as an ideal benchmark (since the support of u* is unknown). 

In this paper, we prove that a regret bound proportional to s is achievable (up to 
logarithmic factors). In Corollary 2 and its refinements (Corollary 7 and Theorem 10), we 
indeed derive regret bounds of the form 

T r T ^ 

J2iyt - ytf ^ inf \ Y,{yt-u- ip{xt)f + (||«||o + 1) gr^iMi , Moo) > (i) 
t=i U=i J 

where ||w||q denotes the number of non-zero coordinates of u and where g grows at most 

logarithmically in T, d, \\u\\^ = Yl'j=i ^^'^ llv'lloo ~ ^^Pxex ^^^i^j^d\^j{x)\- We call 
regret bounds of the above form sparsity regret bounds. 

This work is in connection with several papers that belong either to the statistical or to 
the machine-learning literature. Next we discuss these papers and some related references. 

Related works in the stochastic setting 

The above regret bound (1) can be seen as a deterministic online counterpart of the so-called 
sparsity oracle inequalities introduced in the stochastic setting in the past decade. The lat- 
ter are risk bounds expressed in terms of the number of non-zero coordinates of the oracle 
vector — see (2) below. More formally, consider the regression model with random of fixed 
design. The forecaster observes independent random pairs {Xi,Yi), . . . , {Xt,Yt) e X xM. 

where the Xf & X are either i.i.d random variables (random design) or fixed elements (fixed 
design), denoted in both cases by capital letters in this paragraph, and where the £t are 
i.i.d. square-integrable real random variables with zero mean (conditionally on the Xt if the 
design is random). The goal of the forecaster is to construct an estimator fx-X^Mof 
the unknown regression function / : — )• M based on the sample (X^, 1^)1^^^^. Depending 
on the nature of the design, the performance of fx is measured through its risk RQt)'- 



j (^f{x) — fT{x)^ P'^{dx) (random design) 

1 ^ 

- ^(/(Xi) - g{Xt)f (fixed design). 



T 
t 



where Px denotes the common distribution of the Xt if the design is random. With the 
above notations, and given a dictionary (p = (ifi, • • • , ip^) of base regressors 99^ : A" — )• M as 
previously, typical examples of sparsity oracle inequalities take approximately the form 

in expectation or with high probability, for some constant C ^ 1. Thus, sparsity oracle 
inequalities are risk bounds involving a trade-off between the risk R(u ■ (p) and the number 
of non-zero coordinates ||w||o of any linear combination u G W^. In particular, they indicate 
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that /r has a small risk under a sparsity scenario, i.e., if / is well approximated by a sparse 
linear combination u* of the base regressors ipj, 1 ^ j ^ d. 

Sparsity oracle inequalities were first derived by Birge and Massart (2001) via i^- 
regularization methods (through model-selection arguments). Later works in this direction 
include, among many other papers, those of Birge and Massart (2007); Abramovich et al. 

(2006) ; Bunea et al. (2007a) in the regression model with fixed design and that of Bunea 
et al. (2004) in the random design case. 

More recently, a large body of research has been dedicated to the analysis of i^- 
regularization methods, which are convex and thus computationally tractable variants of 
^O-regularization methods. A celebrated example is the Lasso estimator introduced by Tib- 
shirani (1996) and Donoho and Johnstone (1994). Under some assumptions on the design 
matrix-^, such methods have been proved to satisfy sparsity oracle inequalities of the form (2) 
(with C = 1 in the recent paper by Koltchinskii et al. 2011). A list of few references — but 
far from being comprehensive — includes the works of Bunea et al. (2007b); Candes and Tao 

(2007) ; van de Geer (2008); Bickel et al. (2009); Koltchinskii (2009a,b); Hebiri and van de 
Geer (2011); Koltchinskii et al. (2011); Lounici et al. (2011). We refer the reader to the 
monograph by Biihlmann and van de Geer (2011) for a detailed account on ^^-regularization. 

A third line of research recently focused on procedures based on exponential weighting. 
Such methods were proved to satisfy sharp sparsity oracle inequalities (i.e., with leading con- 
stant C = 1), either in the regression model with fixed design (Dalalyan and Tsybakov, 2007, 
2008; Rigollet and Tsybakov, 2011; Alquier and Lounici, 2011) or in the regression model 
with random design (Dalalyan and Tsybakov, 2011; Alquier and Lounici, 2011). These 
papers show that a trade-off can be reached between strong theoretical guarantees (as with 
£0-regularization) and computational efficiency (as with £^-regularization). They indeed 
propose aggregation algorithms which satisfy sparsity oracle inequalities under almost no 
assumption on the base forecasters {<fj)j, and which can be approximated numerically at a 
reasonable computational cost for large values of the ambient dimension d. 

Our online-learning algorithm SeqSEW is inspired from a statistical method of Dalalyan 

and Tsybakov (2008, 2011). Following the same lines as in Dalalyan and Tsybakov (2009), 
it is possible to slightly adapt the statement of our algorithm to make it computationally 
tractable by means of Langevin Monte-Carlo approximation — without affecting its statisti- 
cal properties. The technical details are however omitted in this paper, which only focuses 
on the theoretical guarantees of the algorithm SeqSEW. 

Previous works on sparsity in the framework of individual sequences 

To the best of our knowledge. Corollary 2 and its refinements (Corollary 7 and Theorem 10) 
provide the first examples of sparsity regret bounds in the sense of (1). To comment on the 
optimality of such regret bounds and compare them to related results in the framework of 
individual sequences, note that (1) can be rewritten in the equivalent form: 



1. Despite their computational efficiency, the aforementioned ^^-regularized methods still suffer from a 

drawback: their ^''-oracle properties hold under rather restrictive assumptions on the design; namely, 
that the ipj should be nearly orthogonal (see the detailed discussion in van de Geer and Biihlmann 2009). 



4 



Sparsity Regret Bounds for Individual Sequences 



For all s G N and all U > 0, 



T T 

^{yt - Vt? - „ inf - w • ip{xt)f ^ (s + l) gT,d{U, , 

where g grows at most logarithmically in T, d, U, and Hylloo- When s <^ T, this upper 
bound matches (up to logarithmic factors) the lower bound of order slnT that follows in 
a straightforward manner from Theorem 2 of Vovk (2001). Indeed, if s <C T, = R'^, and 
ipj{x) = Xj, then for any forecaster, there is an individual sequence {xt,yt)is^tfiT such that 
the regret of this forecaster on |ii G R*^ : HwHq ^ s and ^ is bounded from below 

by a quantity of order slnT. Therefore, up to logarithmic factors, any algorithm satisfying 
a sparsity regret bound of the form (1) is minimax optimal on intersections of £'^-balls (of 
radii s < T) and ^^-balls. This is in particular the case for our algorithm SeqSEW, but 
this contrasts with related works discussed below. 

Recent works in the field of online convex optimization addressed the sparsity issue in 
the online deterministic setting, but from a quite different angle. They focus on algorithms 
which output sparse linear combinations, while we are interested in algorithms whose regret 
is small under a sparsity scenario, i.e., on ^'^-balls of small radii. See, e.g., the papers 
by Langford ct al. (2009); Shalev-Shwartz and Tcwari (2009); Xiao (2010); Duchi et al. 
(2010) and the references therein. All these articles focus on convex regularization. In the 
particular case of ^^-regularization under the square loss, the aforementioned works propose 
algorithms which predict as a sparse linear combination yt = Uf ^(xt) of the base forecasts 
(i.e., ||iit||Q is small), while no such guarantee can be proved for our algorithm SeqSEW. 
However they prove bounds on the ^^-regularized regret of the form 

J2{(yt - ■ xtf + A \\ut\\^ ^ jnf^ 1^^^^* ~ " " ^ ^ ^ ^T,d(w)| , (3) 

for some regret term ^T,d{u) which is suboptimal on intersections of and ^^-balls as 
explained below. The truncated gradient algorithm of Langford et al. (2009, Corollary 4.1) 
satisfies^ such a regret bound with K^^iu) at least of order HvHoo^ when the base 
forecasts ^j{xt) are dense in the sense that maxi^^^T X^j=i ^j{xt) ~ d W^Wl^- This regret 
bound grows as a power of and not logarithmically in d as is expected for sparsity regret 
bounds (recall that we are interested in the case when d S> T). 

The three other papers mentioned above do prove (some) regret bounds with a logarith- 
mic dependence in d, but these bounds do not have the dependence in \\u\\-^ and T we are 
looking for. For p—1 ~ l/(ln d), the p-norm RDA method of Xiao (2010) and the algorithm 
SMIDAS of Shalev-Shwartz and Tewari (2009) — the latter being a particular case of the 
algorithm COMID of Duchi et al. (2010) specialized to the p-norm divergence — satisfy re- 
gret bounds of the above form (3) with Ax^d{u) ^ jJL \\u\\-^ Vrind, for some gradient-based 



2. The bound stated in Langford et al. (2009, Corollary 4.1) differs from (3) in that the constant before 
the infimum is equal to C = 1/(1 — 2c2?7), where « maxi<jt^T 'Yl,'j=i Vji^t) ^ d llv'll^i and where a 
reasonable choice for r? can easily be seen to be 7? « l/\/2c^r. If the base forecasts ipj{xt) axe dense in 
the sense that Ri d \\ip\\'^, then we have C « 1 + \j2c^jT , which yields a regret bound with leading 
constant 1 as in (3) and with ^T,d(u) at least of order ^c^T « ||(^||^VdT. 
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constant fi. Therefore, in all three cases, the function A grows at least linearly in \\u\\i 
and as ^/T. This is in contrast with the logarithmic dependence in and the fast rate 
0(lnT) wc arc looking for and prove, e.g., in Corollary 2. 

Note that the suboptimality of the aforementioned algorithms is specific to the goal we 
are pursuing, i.e., prediction on ^°-balls (intersected with ^^-balls). On the contrary the 
rate ||tt||;^ y/Tlnd is more suited and actually nearly optimal for learning on £^-balls (see 
Gerchinovitz and Yu 2011). Moreover, the predictions output by our algorithm ScqSEW arc 
not necessarily sparse linear combinations of the base forecasts. A question left open is thus 
whether it is possible to design an algorithm which both ouputs sparse linear combinations 
(which is statistically useful and sometimes essential for computational issues) and satisfies 
a sparsity regret bound of the form (1). 

PAC-Bayesian analysis in the framework of individual sequences 

To derive our sparsity regret bounds, we follow a PAC-Bayesian approach combined with the 
choice of a sparsity-favoring prior. We do not have the space to review the PAC-Bayesian 
literature in the stochastic setting and only refer the reader to Catoni (2004) for a thorough 
introduction to the subject. As for the online deterministic setting, PAC-Bayesian-type 
inequalities were proved in the framework of prediction with expert advice, e.g., by Freund 
et al. (1997) and Kivinen and Warmuth (1999), or in the same setting as ours with a 
Gaussian prior by Vovk (2001). More recently, Audibert (2009) proved a PAC-Bayesian 
result on individual sequences for general losses and prediction sets. The latter result relies 
on a unifying assumption called the online variance inequality, which holds true, e.g., when 
the loss function is exp-concave. In the present paper, we only focus on the particular case 
of the square loss. We first use Theorem 4.6 of Audibert (2009) to derive a non-adaptive 
sparsity regret bound. We then provide an adaptive online PAC-Bayesian inequality to 
automatically adapt to the unknown range of the observations maxi^t^y |yt|- 

Application to the stochastic setting when the noise level is unknown 

In Section 4.1 we apply an automatically-tuned version of our algorithm SeqSEW on i.i.d. 
data. Thanks to the standard online-to-batch conversion, our sparsity regret bounds — of 
deterministic nature — imply a sparsity oracle inequality of the same flavor as a result of 
Dalalyan and Tsybakov (2011). However, our risk bound holds on the whole space 
instead of ^^-balls of finite radii, which solves one question left open by Dalalyan and 
Tsybakov (2011, Section 4.2). Besides, and more importantly, our algorithm does not need 
the a priori knowledge of the variance of the noise when the latter is Gaussian. Since the 
noise level is unknown in practice, adapting to it is important. This solves a second question 
raised by Dalalyan and Tsybakov (2011, Section 5.1, Remark 6). 

Outline of the paper 

This paper is organized as follows. In Section 2 we describe our main (deterministic) setting 
as well as our main notations. In Section 3 we prove the aforementioned sparsity regret 
bounds for our algorithm SeqSEW, first when the forecaster has access to some a priori 
knowledge on the observations (Sections 3.1 and 3.2), and then when no a priori information 
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is available (Section 3.3), which yields a fully automatic algorithm. In Section 4 we apply 
the algorithm SeqSEW to two stochastic settings: the regression model with random design 
(Section 4.1) and the regression model with fixed design (Section 4.2). Finally the appendix 
contains some proofs and several useful inequalities. 

2. Setting and notations 

The main setting considered in this paper is an instance of the game of prediction with expert 
advice called prediction with side information (under the square loss) or, more simply, online 
linear regression (see Cesa-Bianchi and Lugosi 2006, Chapter 11 for an introduction to this 
setting). The data sequence {xt,yt)t^i at hand is deterministic and arbitrary and we look 
for theoretical guarantees that hold for every individual sequence. We give in Figure 1 a 
detailed description of our online protocol. 



Parameters: input data set X, base forecasters (f = {ipi, . . . ,ipd) with ipj : X ^ R, 
1^3 ^d. 

Initial step: the environment chooses a sequence of observations {yt)t^i in M and a 
sequence of input data {xt)t^i in X but the forecaster has not access to them. 

At each time round i G N* = {1, 2, . . .}, 

1. The environment reveals the input data xt E X. 

2. The forecaster chooses a prediction G M 

(possibly as a linear combination of the ipj{xt), but this is not necessary). 

3. The environment reveals the observation yt G R. 

4. Each linear forecaster u-ip = Yl'j=i ''^j^j^ ^ incurs the loss {yt — u- cp{xt))'^ 
and the forecaster incurs the loss {yt — ytY- 



Figure 1: The online linear regression setting. 

Note that our online protocol is described as if the environment were oblivious to the 
forecaster's predictions. Actually, since we only consider deterministic forecasters, all re- 
gret bounds of this paper also hold when {xtjt^i and {yt)t^i are chosen by an adversarial 
environment. 

Two stochastic batch settings are also considered later in this paper. See Section 4.1 
for the regression model with random design, and Section 4.2 for the regression model with 
fixed design. 

Some notations 

We now define some notations. We write N = {0, 1, . . .} and e = exp(l). Vectors in M.'^ will 
be denoted by bold letters. For all u,v G W^, the standard inner product in R*^ between 
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u = {ui, . . . , Ud) and v = {vi,. . . ,Vd,) will be denoted hy u ■ v = Yli=j '"j "^j! ^"^"i 
and £^-norms of tt = (ui , . . . , ua) are respectively defined by 

d d 

ll^llo - H^K^o} = |{i : ""j ^ 0}| , = ^ \uj\ , 

j=i j=i 

The set of all probability distributions on a set O (endowed with some cr-algebra, e.g., the 
Borel cr-algebra when 6 = R'^) wih be denoted by Mi{Q). For all p, tt G Mf{Q), the 
KuUback-Leibler divergence between p and tt is defined by 

In ( ) d/9 if /9 is absolutely continuous with respect to tt; 





+00 otherwise, 
where ^ denotes the Radon-Nikodym derivative of p with respect to tt. 

For all x G M and 5 > 0, we denote by \x~\ the smallest integer larger than or equal 
to X, and by [x]b its thresholded (or clipped) value: 

( -B ifx<-B; 
X a -B ^ X ^ B; 
B iix> B. 



[x]b = 



Finally, we will use the (natural) conventions 1/0 = +00, (+00) x = 0, and 01n(l-|-C7/0) 
for all U ^ 0. Any sum Yl^=i indexed from 1 up to is by convention equal to 0. 



3. Sparsity regret bounds for individual sequences 

In this section we prove sparsity regret bounds for different variants of our algorithm 
SeqSEW. We first assume in Section 3.1 that the forecaster has access in advance to a 
bound By on the observations \yt\ and a bound B^ on the trace of the empirical Gram 
matrix. We then remove these requirements one by one in Sections 3.2 and 3.3. 

3.1 Known bounds By on the observations and B^ on the trace of the 
empirical Gram matrix 

To simplify the analysis, we first assume that, at the beginning of the game, the number 
of rounds T is known to the forecaster and that he has access to a bound By on all the 
observations yi, . . . ,yT and to a bound S$ on the trace of the empirical Gram matrix, i.e., 

d T 

yi,...,yT e[-By,By] and ^ ^ (/9^(xt) ^ 5$ . 

j=i t=i 

The first version of the algorithm studied in this paper is defined in Figure 2 (adaptive 

variants will be introduced later). We name it SeqSEW for it is a variant of the Sparse 
Exponential Weighting algorithm introduced in the stochastic setting by Dalalyan and Tsy- 
bakov (2007, 2008) which is tailored for the prediction of individual sequences. 
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The choice of the heavy-tailed prior vTt- is due to Dalalyan and Tsybakov (2007). The 
role of heavy-tailed priors to tackle the sparsity issue was already pointed out earlier; see, 
e.g., the discussion by Seeger (2008, Section 2.1). In high dimension, such heavy-tailed 
priors favor sparsity: sampling from these prior distributions (or posterior distributions 
based on them) typically results in approximately sparse vectors, i.e., vectors having most 
coordinates almost equal to zero and the few remaining ones with quite large values. 



Parameters: threshold B > 0, inverse temperature r/ > 0, and prior scale r > with 
which we associate the sparsity prior tTt G X]^(M'^) defined by 

/ , X A -rr (3/T)dnj 

Initialization: pi ='Kt- 
At each time round t ^ 1, 

1. Get the input data x, and predict" as j?, - / • ; 

2. Get the observation yt and compute the posterior distribution pt+i G A^]'"(M'^) as 

exp f -?7 ^ (ys - [u ■ ^(x,)] ^ J 

Pt+l{du) = — — TTridu) , 



where 



t+i - / exp [ -V^{ys - [v ■ 'P{Xs)]b) j T^ridv) . 



a. The clipping operator [-Jb is defined in Section 2. 



Figure 2: The algorithm SeqSEW^'''. 



Proposition 1 Assvme that, for a known constant By > 0, the {xi,yi), . . . , {xT,yT) cltg 
such that yi, . . . , G [~By, By]. Then, for all B ^ By, all rj ^ l/(8i?^), and all r > 0, the 
algorithm SeqSEW^'^ satisfies 

]yy^-y^)" ^ it - <^M' + ^ ii-iioin(i + ^) I +r^EE^?(-*) ■ 

(4) 
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Corollary 2 Assume that, for some known constants By > and > 0, the 
{xi,yi), {xt, Vt) are such that yi, . . . , € [-By, By] and Yl^i J2t=i V^ji^t) ^ B^ . 

1 /l652 

Then, when used with B = By, ij = and t = \ „ , the algorithm, SeqSEW^'^ 

oBy y B^ 

satisfies 



(5) 



Note that, if Hv'lloo ~ ^^PxeA" ^^'^i^jCd Iv'jt-^)! finite, then the last corollary provides 
a sparsity regret bound in the sense of (1). Indeed, in this case, we can take B^ = dT HvH^, 
which yields a regret bound proportional to ||m||q and that grows logarithmically in d, T, 
and II^jII^. 

To prove Proposition 1, we first need the following deterministic PAC-Baycsian inequal- 
ity which is at the core of our analysis. It is a straightforward consequence of Theorem 4.6 
of Audibert (2009) when applied to the square loss. An adaptive variant of this inequality 
will be provided in Section 3.2. 



Lemma 3 Assume that for some known constant By > 0, we have yi, . . . ,yT G [~By, By]. 
For all T > 0, if the algorithm ScqSEW^'*^* is used with B ^ By and rj ^ 1/{8B^), then 

- ytf ^ ( [X[yt - [u ■ ^{-t)]^yp{du) + ^^^1 (6) 

^ I / - ^ • <^(^*))'p(d«) + ^^^1 . (7) 

Proof (of Lemma 3) Inequality (6) is a straightforward consequence of Theorem 4.6 of 
Audibert (2009) when applied to the square loss, the set of prediction functions Q = [x 
[u ■ (p{x)] ^ : w G M*^} , and the prior^ tt on ^ induced by the prior tTt- on via the mapping 
w G M'^^ [u-ip{-)]^ G g. 

To apply the aforementioned theorem, recall from Cesa-Bianchi and Lugosi (2006, Sec- 
tion 3.3) that the square loss is l/(8i?^)-exp-concave on [—B,B] and thus 77-exp-concave^ 
(since rj ^ 1/(85^) by assumption). Therefore, by Theorem 4.6 of Audibert (2009) with 
the variance function = (see the comments following Remark 4.1 therein), we get 

E^yt-ytf^ inf I [Eiy^-9ixt))^idg) + '^\ 

3. The set Q is endowed with the cr-algebra generated by all the coordinate mappings g € Q ^ g{x) € K, 
X £ X (where R is endowed with its Borel a-algebra). 

4. This means that for all y G [—B, B], the function x exp(— — x)^) is concave on [—B, B]. 



10 



Sparsity Regret Bounds for Individual Sequences 



where the last inequality follows by restricting the infimum over Aif(^Q^ to the subset 
{p:p€ Mf{W^)} C Mf{G), where p € Mi[G) denotes the probabihty distribution 
induced by p G M.f{W^) via the mapping li G M"* [tt • ¥'(•)] ^ G S- Inequality (6) then 
follows from the fact that for all p e M-i{M.'^), we have /C(p, vr) ^ /C(p, tt^) by joint convexity 
of/C(-,-). 

As for Inequality (7), it follows from (6) by noting that 

yye[-B,B], VxGM, \y - [x]b\ ^ \y - x\ . 

Therefore, truncation to [—B, B] can only improve prediction under the square loss if the ob- 
servations are [—B, i3]-valued, which is the case here since by assumption yt € [—By, By] C 
[-B,B]fov allt = l,...,T. U 



Remark 4 As can be seen from the previous proof, if the prior tTt- used to define the algo- 
rithm SeqSEW was replaced with any prior tt G M.f{M.'^), then Lemma 3 would still hold 
true with tt instead of tTt- This fact is natural from a PAC-Bayesian perspective (see, e.g., 
Catoni, 2004; Dalalyan and Tsybakov, 2008). We only — hut crucially — use the particu- 
lar shape of the sparsity-favoring prior iTr to derive Proposition 1 from the PAC-Bayesian 
bound (7). 

Proof (of Proposition 1) Our proof mimics the proof of Theorem 5 by Dalalyan and 
Tsybakov (2008) . We thus only write the outline of the proof and stress the minor changes 
that are needed to derive Inequality (4). The key technical tools provided by Dalalyan and 
Tsybakov (2008) are reproduced in Appendix B.2 for the convenience of the reader. 

Let u* G M'^. Since B ^ By and rj ^ l/(8i?^), we can apply Lemma 3 and get 

j^ivt-yt?^ inf (/ Y.iyt-^-^i^t))'pidu) + ^^^] 

^ [ E(yt - « • ¥'(.xt))'p.^.(d^) + ^^i^HVl^ . (8) 

In the last inequality, Pu*,T is taken as the translated of tTt- at u*, namely, 

/, N A dTTr , ^, , -A- (3/r)dwj 
Pu*,T{du) = —{u-u )du = [[ 



fJl 2{1 + 



The two terms (1) and (2) can be upper bounded as in the proof of Theorem 5 by Dalalyan 
and Tsybakov (2008). By a symmetry argument recalled in Lemma 22 (Appendix B.2), the 
first term (1) can be rewritten as 

„ T T (IT 

t=l t=l j=l t=l 
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As for the term (2), we have, as is recalled in Lemma 23, 



^(^--^-)^lK||olnfl + ^) . (10) 



V V \ Wu'WqT, 

Combining (8), (9), and (10), which all hold for all u* G W^, we get Inequality (4). ■ 
Proof (of Corollary 2) Applying Proposition 1, we have, since 

< -f.{E(:^. - " • ^(-f + i ll"lloln(l + ^) } + . (11) 

since X]j=i vji^t) ^ by assumption. The particular (and nearly optimal) choices 

of T] and r given in the statement of the corollary then yield the desired inequality (5). ■ 

We end this subsection with a natural question about approximate sparsity: Proposi- 
tion 1 ensures a low regret with respect to sparse linear combinations u e M.^, but what 
can be said for approximately sparse linear combinations, i.e., vectors u G M*^ that are very 
close to a sparse vector? As can be seen from the proof of Lemma 23 in Appendix B.2, the 
sparsity-related term 

^ II II 1 ^ ll^lli 
- w Lin 1 + TT^, — 



V ' ^ V ll^llo'^ 

in the regret bound of Proposition 1 can actually be replaced with the smaller (and conti- 
nous) term 

4 

-^ln(l + |u,|/T) . 

The last term is always smaller than the former and guarantees that the regret is small with 
respect to any approximately sparse vector u eW^. 



3.2 Unknown bound By on the observations but known bound B^ on the trace 
of the empirical Gram matrix 

In the previous section, to prove the upper bounds stated in Lemma 3 and Proposition 1, 
we assumed that the forecaster had access to a bound By on the observations \yt\ and 
to a bound B^ on the trace of the empirical Gram matrix. In this section, we remove 
the first requirement and prove a sparsity regret bound for a variant of the algorithm 
SeqSEW^''' which is adaptive to the unknown bound By = maxi^j^r \yt\', see Proposition 5 
and Remark 6 below. 

For this purpose we consider the algorithm of Figure 3, which we call SeqSEW* there- 
after. It differs from SeqSEWi^'^' defined in the previous section in that the threshold B 
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Parameter: prior scale r > with which we associate the sparsity prior tTj- G M.i{R'^) 
defined by 

Initialization: Bi = 0, r/i = +00, and pi = tTt- 
At each time round t ^ 1, 

1. Get the input data xt and predict*^ as yt = \u ■ fixt)] □ Pt{du); 

2. Get the observation yt and update: 

1 /9 

• the threshold Bt+i = (2^^°S2'^^l^siity^^^ ^ 

• the inverse temperature rjt+i = l/(8i?|^_]^) , 

• and the posterior distribution pt^i G as 



exp ( -r]t+i (vs - [u ■ (p{. 
Pt+i{du)^ ^ 



Wt+1 




where 



Wt+i = J^^ exp ^~rit+iYl{y^ - ■ 'Pi^s)] ^ TTr{dv) . 



a. The clipping operator [-js is defined in Section 2. 



Figure 3: The algorithm SeqSEW*. 



and the inverse temperature rj are now allowed to vary over time and are chosen at each 
time round as a function of the data available to the forecaster. 

The idea of truncating the base forecasts was used many times in the past; sec, e.g., 
the work of Vovk (2001) in the online linear regression setting, that of Gyorfi et al. (2002, 
Chapter 10) for the regression problem with random design, and the papers of Gyorfi and 
Ottucsak (2007); Biau et al. (2010) for sequential prediction of unbounded time series under 
the square loss. A key ingredient in the present paper is to perform truncation with respect 
to a data-driven threshold. The online tuning of this threshold is based on a pseudo- 
doubling-trick technique provided by Cesa-Bianchi et al. (2007). (We use the prefix pseudo 
since the algorithm does not restart at the beginning of each new regime.) 
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Proposition 5 For all r > 0, the algorithm SeqSEW* satisfies 

J2{y,-y,f^ ini i'£{yt-u-^{xt)f + 32B^^,\\u\\Qln(l + j^]\ (12) 

d T 

j=l t=l 

where B^_^-^ = 2riog2maxi^t^T2/?l ^ 2 maxi^t^ryt- 

Remark 6 In view of Proposition 1, the algorithm SeqSEW* satisfies a sparsity regret 
bound which is adaptive to the unknown bound By = maxi^t^r \yt\- The price for the 
automatic tuning with respect to By consists only of a multiplicative factor smaller than 2 
and the additive factor 16B^^^ which is smaller than 325^. 

As in the previous section, several corollaries can be derived from Proposition 5. If the 
forecaster has access beforehand to a quantity B,^ > such that Yl'j=i Ylt=i ^ji^t) ^ B^, 
then a suboptimal but reasonable choice of r is given by r = B^ ; see Corollary 7 below. 
The simpler tuning^ r = l/yfdT of Corollary 8 will be useful in the stochastic batch setting 
(cf. Section 4). The proofs of the next corollaries are immediate. 

Corollary 7 Assume that, for a known constant B^ > 0, the {xi, yi), . . . , (xt, Ut) dre such 
that Ylt=i y^'ji^t) ^ Then, when used with r = 1/y/B^, the algorithm SeqSEW* 

satisfies 

Y^{y,-y,f^ inf \ j2{yt-^- Vi^t))' + 32B^+i lltillo In f 1 + , ''""O ^^^^ 

ti ^^^'{7^1 V ll^llo yj 

+ 1651^.1 + 1 , 

where B^^^ = 2riog2maxi^t^T2/|l ^ 2 maxi^t^T?/? • 



Corollary 8 Assume that T is known to the forecaster at the beginning of the prediction 
game. Then, when used with r = l/V dT, the algorithm SeqSEW* satisfies 

Y,{yt - yt? ^ inf^ \Y,{yt-u- ip{xt)f + Z2BI^, \\u\\^\n [l + ^11''"^ ) I (14) 



t=i U=i 



d T 



+ ^JlT.'p'M) + ieB'T+i , 

where Bl_^ = 2rios2 maxi^^^r yfl ^ 2 maxi^^^Tyl- 



5. The tuning t = 1/V dT only uses the knowledge of T, which is known by the forecaster in the 
stochastic batch setting. In that framework, another simple and easy-to-analyse tuning is given by 
r = \l{\\ip\\^\fdT) — which corresponds to B# = dT||(/3||^ — but it requires that Hv'llt^ = 
suPa;6A- niaxi^j^d IVj (a;) I be finite. Note that the last tuning satisfies the scale-invariant property pointed 
out by Dalalyan and Tsybakov (2011, Remark 4). 
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As in the previous section, to prove Proposition 5, we first need a key PAC-Bayesian 
inequality. The next lemma is an adaptive variant of Lemma 3. 



Lemma 9 For all r > 0, the algorithm SeqSEW* satisfies 

Y.(yt - ytf ^ inf I / Y,U-[u- >^{xt)\ p(dti) + 85^+1 /C(p, TT,) I + 8B 

T ' 

yj^.Y^y^-'^- ^{xt)fp{^u) + SSl ^1 /C(p, TT^) I + 16S|,+i , 



^ inf 



(15) 
(16) 



where = 2riog2maxi^t^T2/?l ^ 2 maxi^j^ryl- 



Proof (of Lemma 9) The proof is based on arguments that are similar to those underlying 
Lemma 3, except that we now need to deal with B and r] changing over time. In the same 
spirit as in Auer et al. (2002); Cesa-Bianchi et al. (2007); Gyorfi and Ottucsak (2007), our 
analysis relies on the control of (In Wt+i)/r7t_|_i — {lnWt)/r)t where Wi = 1 and, for all 2, 

Wt = j^^ exp {^rjt ^ {vs - [u ■ (p{xs)] j 7r^(dw) . 

On the one hand, we have 

In Wt+1 In Wi 1 



J^d (^-'^T+iY^iyt - [u ■ <p{xt)] B,) j vrr(dw) - ^ In 1 
inf If t.iy^-[u- ^{x,)]Sp{d^) + , (17) 



r]T+i Vi flT+i 

T 



where the last equality follows from a convex duality argument for the KuUback-Leibler di- 
vergence (cf., e.g., Catoni 2004, p. 159) which we recall in Proposition 21 in Appendix B.l. 

On the other hand, we can rewrite (In VFt+i)/??t+i — (InVFi)/??! as a telescopic sum and 
get 

In Wt+1 In 1^1 ^ / In Wt+i In Wt \ ^^ nnWt+i lnW^/+i ^ 




(1) 

(18) 

where VF/^^ is obtained from Wt+i by replacing rjt+i with 77*; namely, 

= J^^ exp (^-VtYl{y^ - ■ ^ 7r^(dw) ■ 
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Let t G {l,...,r}. The first term (1) is non-positive by Jensen's inequality (note that 
X I— 7- x^^+^^^^ is concave on since r/t+i ^ r]t by construction). As for the second term (2), 
by definition of VFj+i, 

lln^ 

Wt ^ / t-i 




e^p(-r]t(^yt - [u ■ ^{xt)]^^) ^ exp \^mJ2(y^ ~ ' 'Pi^^)] 



exp (^-r]t(^yt - [u ■ 'pixt)]^^) ^ Pt{du) (19) 



^ / -{yt - ytf if Bt+i = Bt; , 

^ I -{yt - yt? + (25,+i)2 if Bt+i > Bt; ^'^> 

where (19) follows by definition of Pf To get Inequality (20) when -B^+i = Bt, or, cquiva- 
Icntly, \yt\ ^ Bt, wc used the fact that the square loss is l/(8i?|)-exp-concave on [—Bt,Bt] 
(as in Lemma 3). Indeed, by definition of rjt = l/{8Bf) and by Jensen's inequality, we get 

^-vt {yt- [uM-t)] ^^^^^^ ^ l^y^ " ■ ^^""^^^ bM"^""^) = e-"*^^'-^*)' , 

where the last equality follows by definition of yt- Taking the logarithms of both sides of 
the last inequality and dividing by r)t, we get (20) when Bt+i = Bt. 

As for the rounds t such that Bt+i > Bt, the square loss x ^ {yt — x? is no longer 
l/(8B^)-exp-concave on [—Bt,Bt]. In this case (20) follows from the cruder upper bound 
{l/i]t)HWUi/Wt) -{yt - yt? + {2Bt+i? (since \yt\,\yt\ ^ Bt+i). Summing (20) 

over t = 1, . . . ,T, Equation (18) yields 

'^^-'^^-j:{yt-yt? + ^ E Bl,^-j2{yt-yt? + 8Bl^., (21) 

t:Bt+i>Bt 

where, setting K = [log2 maxi^j^T ?/j ] , we bounded the geometric sum X^^Bt+i>Bt -^l+i 
from above by Ef=-oo '^^ = 2^+^ = ^B^+i in the same way as in Theorem 6 of Cesa- 
Bianchi et al. (2007). 

Putting Equations (17) and (21) together, we get the PAC-Bayesian inequality 



- yt? < inf If f:(yt-[u- ^{xt)] ^] %idu) + | + 8B 



T+1 



which yields (15) by definition of rjT+i — 1/{8B^^^). The other PAC-Baycsian inequality 
(16), which is stated for non-truncated base forecasts, follows from (15) by the fact that trun- 
cation to Bt can only improve prediction if \yt\ ^ Bt- The remaining t's such that \yt\ > Bt 
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then just account for an overall additional term at most equal to Yl,t:Bt+\>Bt 

8S|,_j.^, which concludes the proof. ■ 

Proof (of Proposition 5) The proof follows the exact same lines as in Proposition 1 
except that we apply Lemma 9 instead of Lemma 3. Indeed, using Lemma 9 and restricting 
the infimum to the Pu*,T, u* G (cf. (41)), we get that 

yZiVt - ytf ^ mi { y^iyt - u ■ ip(xt))'^pu",Tidu) + 8S|., l/C(p^^.,r, tt^) > + 16B|. , 1 
^ inf i - u* • ip{xt) f + 32i?2 ||^,*|| In (l + 



d T 

j=i t=i 

where the last inequality follows from Lemmas 22 and 23. ■ 
3.3 A fully automatic algorithm 

In the previous section, we proved that adaptation to By was possible. If wc also no longer 
assume that a bound on the trace of the empirical Gram matrix is available to the 
forecaster, then we can use a doubling trick on the nondecreasing quantity 



It 



^ In 1 + 



t d 



\ s=l j=l 



and repeatedly run the algorithm SeqSEW* of the previous section for rapidly-decreasing 
values of r. This yields a sparsity regret bound with extra logarithmic multiplicative fac- 
tors as compared to Proposition 5, but which holds for a fully automatic algorithm; see 
Theorem 10 below. 

More formally, our algorithm SeqSEW* is defined as follows. The set of all time rounds 
t = 1, 2, ... is partitioned into regimes r = 0, 1, . . . whose final time instances tr are data- 
driven. Let t-i = by convention. We call regime r, r = 0, 1, . . ., the sequence of time 
rounds {U-i + 1,... ,tr) where tr is the first date t ^ tr-i + 1 such that > 2^". At the 
beginning of regime r, wc restart the algorithm SeqSEW* defined in Figure 3 with the 
parameter r set to = 1/ (exp(2'') — l). 

Theorem 10 Without requiring any preliminary knowledge at the beginning of the predic- 
tion game, SeqSEW* satisfies, for allT ^ 1 and all {xi,yi), . . . , {xT,yT) G x M, 

T f T / T d 

^{yt-ytf ^ in^i^{yt-u-ip{xt)f + 2bQ(^ra^^yl^ ||w||olnje+, ^^^"jixt) 
t=i \ t=i ^ ^ \ \ t=i i=i 

+ 64( max ,?)^^ ||.||oln + p^) } + (l + 38--^/?)^^ , 
where At ^2 + log^ In (e + i/eL E,ti^'(^0) • 



17 



Gerchinovitz 



On each regime r, the current instance of the algorithm SeqSEW*^ only uses the past 
observations ys, s G {tr-i + • • • — 1}) to perform the online trunction and to tune the 
inverse temperature parameter. Therefore, the algorithm SeqSEW* is fully automatic. 

Note however that two possible improvements could be addressed in the future. Prom 
a theoretical viewpoint, can we contruct a fully automatic algorithm with a bound similar 
to Theorem 10 but without the extra logarithmic factor At? From a practical viewpoint, 
is it possible to perform the adaptation to i?# without restarting the algorithm repeatedly 
(just like we did for By)7 A smoother time-varying tuning {Tt)t^2 might enable to answer 
both questions. This would be very probably at the price of a more involved analysis (e.g., 
if we adapt the PAC-Bayesian bound of Lemma 9, then a third approximation term would 
appear in (18) since tt-j-^ changes over time). 

Proof sketch (of Theorem 10) The proof relies on the application of Proposition 5 with 
T = Tr on all regimes r visited up to time T. Summing the corresponding inequalities over 
r then concludes the proof. See Appendix A.l for a detailed proof. ■ 

Theorem 10 yields the following corollary. It upper bounds the regret of the algorithm 
SeqSEW* uniformly over all u such that ||w||o ^ s and \\u\\^ ^ U, where the sparsity 

level s G N and the £^-diameter J7 > are both unknown to the forecaster. The proof is 
postponed to Appendix A.l. 

Corollary 11 Fix s G N and U > 0. Then, for all T ^ 1 and all {xi,yi), . . . , {xT^yr) £ 
X R, the regret of the algorithm SeqSEW* on {w G M'^ : ||u||q ^ s and \\u\\^ ^ J7 } is 
hounded by 



T T 




+ fl + 38 max yj^T , 



where At ^ 2 + log^ In (e + ■ 

4. Adaptivity to the unknown variance in the stochastic setting 

In this section, we apply the online algorithm SeqSEW* of Section 3.2 to two related stochas- 
tic settings: the regression model with random design (Section 4.1) and the regression model 
with fixed design (Section 4.2). The sparsity regret bounds proved for this algorithm on 
individual sequences imply in both settings sparsity oracle inequalities with leading con- 
stant 1. These risk bounds are of the same flavor as in Dalalyan and Tsybakov (2008, 2011) 
but they are adaptive (up to a logarithmic factor) to the unknown variance of the noise 
if the latter is Gaussian. In particular, we solve two questions left open by Dalalyan and 
Tsybakov (2011) in the random design case. 
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In the sequel, just like in the online deterministic setting, we assume that the forecaster 
has access to a dictionary <^ = {ipi, . . . ,(pd) of measurable base regressors ipj : X ^ R, 
j = l,...,d. 

4.1 Regression model with random design 

In this section we apply the algorithm SeqSEW* to the regression model with random design. 
In this batch setting the forecaster is given at the beginning of the game T independent 
random copies {Xi,Yi), . . . , {Xt,Yt) of (X,Y) & X xM whose common distribution is 
unknown. We assume thereafter that E[y^] < oo; the goal of the forecaster is to estimate 
the regression function f : X ^ M. defined by f{x) = E[Y\X = x] for all x e X. Setting 
et = Yt- f{Xt) for all t = 1, . . . ,T, note that 

Yt = f{Xt) +et, l^t^T , 

and that the pairs (Xi, ei), . . . , {Xt,£t) are i.i.d. and such that E[e^] < oo and E[ei|Xi] = 
almost surely. In the sequel, we denote the distribution of X by and we set, for all 
measurable functions h : X ^R, 



1 ^2 



Next we construct a regressor X based on the sample (Xi, Yi), . . . , {Xt,Yt) that 
satisfies a sparsity oracle inequality, i.e., its expected L^-risk E[||/ — /t||^2] is almost as 
small as the smallest L^-risk \\f — u ■ ¥'||^2, u G M"*, up to some additive term proportional 
to ||w||o- 

4.1.1 Algorithm and main result 

Even if the whole sample (Xi, Yi), . . . , {Xt,Yt) is available at the beginning of the predic- 
tion game, we treat it in a sequential fashion. We run the algorithm SeqSEW* of Section 3.2 
from time 1 to time T with r = 1/ a/ dT (note that T is known in this setting) . Using the 
standard online-to-batch conversion (see, e.g., Littlestone 1989; Cesa-Bianchi et al. 2004), 
we define our data-based regressor fx X as the uniform average 



h^^j^ft (22) 



T 



of the regressors ft' X sequentially built by the algorithm SeqSEW* as 

Jt{x)^ I [u-^{x)\ pt{du). (23) 

Note that, contrary to much prior work from the statistics community such as those 
of Catoni (2004); Bunea and Nobel (2008); Dalalyan and Tsybakov (2011), the regressors 
/t : — 7- ]R are tuned online. Therefore, fx does not depend on any prior knowledge 
on the unknown distribution of the (Xf,!^), 1 ^ i ^ T, such as the unknown variance 
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E[(y — /(X))^] of the noise, the H^'jUoo, or the ||/ — (fjW^ (actually, the ipj and the / — ipj 
do not even need to be bounded in i^^-norm). 

In this respect, this work improves on that of Bunea and Nobel (2008) who tune their 
online forecasters as a function of \\f\\^ and sup^jg^ ||w • where W C M*^ is a bounded 

comparison set^. Their technique is not appropriate when \\f\\^ is unknown and it cannot 
be extended to the case where U = W^ (since sup^^jgd \\u ■ (fiW^ = +00 if if ^ 0). The major 
technique difference is that we truncate the base forecasts u ■ (p{Xt) instead of truncating 
the observations Yj. In particular, this enables us to aggregate the base regressors u- (p for 
all u G W^, i.e., over the whole space. 

The next sparsity oracle inequality is the main result of this section. It follows from the 
deterministic regret bound of Corollory 8 and from Jensen's inequality. Two corollaries are 
to be derived later. 



Theorem 12 Assume that (Xi, Yi), . . . , {Xt,Yt) G X xM. are independent random copies 
of {X, Y)eXxR, where E[Y^] < +00 and ||^2 = E[ipj{Xf] < +00 for all j = 1,. 
Then, the data-based regressor fx defined in (22) -(23) satisfies 



,d. 









2 




E 




/-/t 


L2_ 


^ inf I 






ueM<* 1 



E [maxi^t^T F/] 



|u||oln 1 + 



+ 



1 



2 ymaxi^t^o^ 

Il2 + - 



lull 



Note that our risk bounds are stated in expectation (which already improves on exist- 
ing results in the stochastic setting, see the next section). It is however possible to derive 
high-probability bounds by using techniques borrowed, e.g., from Zhang (2005, Theorem 8) 
or Kakade and Tewari (2009, Theorem 2). 

Proof sketch (of Theorem 12) By Corollary 8 and by definition of ft above and 
yt = ft{Xt) in Figure 3, we have, almost surely, 



Y^^{Yt-ft{Xt))^^ jnf^|^(Ft-«-¥'(Xt))V64(^m^^y/) l|w|loln(^l + 



^dT\\u\\.^ 

Ilwlln 



d T 



+ — V V (fi^AXt) + 32 max Y.^ . 
j=i t=i 

Taking the expectations of both sides and applying Jensen's inequality yields the desired 
result. For a detailed proof, see Appendix A. 2. ■ 



Theorem 12 above can be used under several assumptions on the distribution of the 
output Y. In all cases, it suffices to upper bound the amplitude E [maxi^t^cT 1^^^] . We 

6. Bunea and Nobel (2008) study the case where U is the (scaled) simplex in R'' or the set of its vertices. 
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present below a general corollary and explain later why our fully automatic procedure fx 
solves two questions left open by Dalalyan and Tsybakov (2011) (see Corollary 14 below). 

4.1.2 A GENERAL COROLLARY 

The next sparsity oracle inequality follows from Theorem 12 and from the upper bounds 
on E [maxi^t^T^ 1^^] entailed by Lemmas 24-26 in Appendix B. The proof is postponed to 
Appendix A. 2. 

Corollary 13 Assume that {Xi,Yi), . . . , {Xt, Yt) E X xM. are independent random copies 
of {X,Y) E X X that supi^j^^\\ipj\\'^j^2 < +oo, that K\Y\ < +oo, and that one of the 
following assumptions holds on the distribution of AY = Y — K[Y]. 

• (BD(B)) : |Ay| ^ B almost surely for a given constant B > 0; 

• (SG((7^)) : Ay is subgaussian with variance factor > 0, that is, E [e^^^] ^ g^^*^^/^ 
for all A G M; 

• (BEM(a,M)) : AY has a bounded exponential moment, that is, E [e^l^^l] ^ M for 
some given constants a > and M > 0; 

• (BM(a,M)) : AY has a bounded moment, that is, E[|Ay|"] ^ M for some given 
constants a > 2 and M > 0. 



Then, the data-based regressor fx defined above satisfies 

'E[y]2 









2 




E 




/-/t 


L2_ 


^ inf I 






uew^ 1 



T 



+ ^t) ||w|loln 1 + 



^\\u\\ 



\u\\ 



+ 



+ V"2 



where 



V'T^^E 



max {Yt-E[Yt]y 



^ < 



S2 

T 

20-2 ln(2er) 
T 

ln2((M + e)r) 



un 



der Assumption (BD(B)), 
under Assumption (SG(cr^)), 
under Assumption (BEM(a, M)) , 
under Assumption (BM{a, M)) . 



Several comments can be made about Corollary 13. We first stress that, if T ^ 2, then 
the two "bias" terms E[y]^/r above can be avoided, at least at the price of a multiplicative 
factor of 2T/ (T — 1) ^ 4. This can be achieved via a slightly more sophisticated online 
clipping — see Remark 19 in Appendix A. 2. 
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Second, under the assumptions (BD(B)), (SG(cr^)), or (BEM(a,M)), the key quantity 
ipT is respectively of the order of 1/T, ln(T)/T and In^ (T)/T. Up to a logarithmic factor, 
this corresponds to the classical fast rate of convergence 1/T obtained in the random de- 
sign setting for different aggregation problems (see, e.g., Catoni 1999; Juditsky et al. 2008; 
Audibert 2009 for model-selection-type aggregation and Dalalyan and Tsybakov 2011 for 
linear aggregation). We were able to get similar rates — with, however, a fully automatic 
procedure — since our online algorithm SeqSEW* is well suited for bounded individual se- 
quences with an unknown bound. More precisely, the finite i.i.d. sequence 11,...,!^ is 
almost surely uniformly bounded by the random bound maxi^^^T |^|- Our individual se- 
quence techniques adapt sequentially to this random bound, yielding a regret bound that 
scales as maxi^^^T^^- As a result, the risk bounds obtained after the online-to-batch 
conversion scale as E[maxi^t^T ^t^] /r. If the distribution of the output Y is sufficiently 
lightly-tailed — which includes the quite general bounded-exponential-moment assumption — 
then we can recover the fast rate of convergence 1/T up to a logarithmic factor. 

We note that there is still a question left open for heavy-tailed output distributions. For 
example, under the bounded moment assumption (BM(a, M)) , the rate T'-("-2)/q; ^-^^^^ 
proved does not match the faster rate obtained by Juditsky et al. (2008); Au- 

dibert (2009) under a similar assumption. Their methods use some preliminary knowledge 
on the output distribution (such as the exponent a). Thus, obtaining the same rate with a 
procedure tuned in an automatic fashion — ^just like our method fx — is a challenging task. 
For this purpose, a different tuning of r]t or a more sophisticated online truncation might 
be necessary. 

Third, several variations on the assumptions are possible. First note that several clas- 
sical assumptions on Y expressed in terms of f{X) and e = Y — f{X) are either particular 
cases of the above corollary or can be treated similarly. Indeed, each of the four assumptions 
above on AY = Y—¥,[Y] = f {X) —E.[f {X)]+ e is satisfied as soon as both the distribution of 
f(X) — E[f{X)] and the conditional distribution of £ (conditionally on X) satisfy the same 
type of assumption. For example, if f{X) — E[f{X)] is subgaussian with variance factor cr^ 
and if e is subgaussian conditionally on X with a variance factor uniformly bounded by a 
constant cj^ , then AY is subgaussian with variance factor + c"e (see also Remark 20 in 
Appendix A. 2 to avoid conditioning). 

The assumptions on f{X) — E[/(X)] and £ can also be mixed together. For instance, as 
explained in Remark 20 in Appendix A. 2, under the classical assumptions 



^ < +00 and E 



X 



^ M a.s. (24) 



or 



< -l-oo and E 



X 



^ e^''^'/^ a.s., VA G M , (25) 



the key quantity ipT in the corollary can be bounded from above by 



2 2W({M + e)T) ^ ^ 

°° H — ^— under the set of assumptions (24) , 



T o?T 
^ 4c72 ln(2er) 



oo 



-\ — under the set of assumptions (25) . 



22 



Sparsity Regret Bounds for Individual Sequences 



In particular, under the set of assumptions (25), our procedure fx solves two questions 
left open by Dalalyan and Tsybakov (2011). We discuss below our contributions in this 
particular case. 

4.1.3 Questions left open by Dalalyan and Tsybakov 

In this subsection we focus on the case when the regression function / is bounded (by an 

unknown constant) and when the noise e = Y — f{X) is subgaussian conditionally on X in 
the sense that, for some (unknown) constant > 0, 

^ e^^"^/^ a.s., VA G M . (26) 

A particular important case is when ||/||;^ < +oo and when the noise e is independent 
of X and normally distributed N'{0,a'^). 

Under the set of assumptions (26), the two terms E[maxi<c(^r F/] of Theorem 12 can be 
upper bounded in a simpler and slightly tighter way as compared to the proof of Corollary 13 
(we only use the incquaUty (x+y)"^ ^ 2x^+2y^ once, instead of twice). It yields the following 
sparsity oracle inequality. 



< +00 



and 



E 



X 



Corollary 14 Assume that {Xi, Yi), . . . , (Xt, Yt) & X xM are independent random copies 
of {X, Y) G A' X M such that the set of assumptions (26) above holds true. Then, the 
data-based regressor fx defined in (22) -(23) satisfies 



E 



f-fT 



L2 



^ jnf^l 11/ - n ■ ^Wl^ + 128(||/||^ + 2a^ ln{2eT)) J 



u\ 



^In 1 + 



/dT\\u\ 



\u\\ 



Proof We apply Theorem 12 and bound E [maxi^^^r Y^^ from above. By the elementary 
inequality {x + y)^ ^ 2x^ + 2j/^ for all x, y G M, we get 



E 



max Yf 



= E 
^ 2 



max (f{Xt) + ety 



^ 2 



+ E 



max £f 



^ + 2a2ln(2er)) , 



where the last inequality follows from Lemma 24 in Appendix B and from the fact that, for 
all 1 ^ t ^ T and all A G M, we have E[e^=f] = E[e^=] = E[E[e^'^ | ^]] ^ e^''"'/^ by (26). 
(Note that the assumption of conditional subgaussianity in (26) is stronger than what we 
need, i.e., subgaussianity without conditioning.) This concludes the proof. ■ 



The above bound is of the same order (up to a hi T factor) as the sparsity oracle inequal- 
ity proved in Proposition 1 of Dalalyan and Tsybakov (2011). For the sake of comparison 
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we state below with our notations (e.g., (3 therein corresponds to 1/77 in this paper) a 
straightforward consequence of this proposition, which follows by Jensen's inequality and 
the particular^ choice r = 1/ a/ST. 



Proposition 15 (A consequence of Prop. 1 of Dalalyan and Tsybakov 2011) 

Assume that supi^j^^\\ipj\\^ < 00 and that the set of assumptions (26) above hold true. 

Then, for all R > 2^ d/T and all rj ^ fi{R) = (2cj^ +2 sup||^||^<;^ \\u ■ (f — /||^) ^ , the mir- 
ror averaging aggregate /t- : A" — >■ M defined by Dalalyan and Tsybakov (2011, Equations (1) 
and (3)) satisfies 



E 



/-/r 




4 llwl 




4 2 1 

.7=1 ^ ' 



We can now discuss the two questions left open by Dalalyan and Tsybakov (2011). 

Risk bound on the whole space. Despite the similarity of the two bounds, the 

sparsity oracle inequality stated in Proposition 15 above only holds for vectors u within 
an -£^-ball of finite radius R — 2yd/T, while our bound holds over the whole M"^ space. 
Moreover, the parameter R above has to be chosen in advance, but it cannot be chosen 
too large since l/ry ^ l/fi{R), which grows as i?^ when R +00 (if (p ^ 0). Dalalyan 
and Tsybakov (2011, Section 4.2) thus asked whether it was possible to get a bound with 
l/?7 < +00 such that the infimum in Proposition 15 extends to the whole space. Our 
results show that, thanks to data-driven truncation, the answer is positive. 

Note that it is still possible to transform the bound of Proposition 15 into a bound over 
the whole M.'^ space if the parameter R is chosen (illegally) as i? = ||tt* ||-|^ + 2^/ d/T (or as a 
tight upper bound of the last quantity), where u* G minimizes over M'^ the regularized 
risk 

II ii2 4 llwlL , / \/cir ||w||i \ 
\\f -u- (fiWii H -IJ— ^In 1 H „ 

4 *^ 1 
+ ^ g Ml^ + (^T + l)fj{\\u\\, + 2^) • 

For instance, choosing R = \\u* \\^ + 2y/d/T and rj = fi{R), we get from Proposition 15 that 
the expected L^-risk E[||/ — /r|||2] of the corresponding procedure is upper bounded by 

7. Proposition 1 of Dalalyan and Tsybakov (2011) may seem more general than Theorem 12 at first sight 
since it holds for all r > 0, but this is actually also the case for Theorem 12. The proof of the latter 
would indeed have remained true had we replaced t = 1/V dT with any value of t > 0. We however 
chose the reasonable value r = l/v^dT to make our algorithm parameter-free. As noted earlier, if 
llV'lloo ^ sup^g^:^. maxi^jsctj is finite and known by the forecaster, another simple and easy-to- 

analyse tuning is given by t = '^/(W'pW^y/dT). 
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the infimum of the above regularized risk over all u G W^. However, this parameter tuning 
is illegal since ||w*||^ is not known in practice. On the contrary, thanks to data-driven 
truncation, the prior knowledge of is not required by our procedure. 

Adaptivity to the unknown variance of the noise. The second open question, which 

was raised by Dalalyan and Tsybakov (2011, Section 5.1, Remark 6), deals with the prior 
knowledge of the variance factor o"^ of the noise. The latter is indeed required by their 
algorithm for the choice of the inverse temperature parameter 77. Since the noise level 
is unknown in practice, the authors asked the important question whether adaptivity to cr^ 
was possible. Up to a In T factor, Corollary 14 above provides a positive answer. 

4.2 Regression model with fixed design 

In this section, we consider the regression model with fixed design. In this batch setting the 
forecaster is given at the beginning of the game a T-sample {xi,Yi), . . . , [xt, Yt) G x M, 
where the xt are deterministic elements in X and where 

Yt = f{xt)+et, l^t^T, (27) 

for some i.i.d. sequence £1 , . . . , £t € M (with unknown distribution) and some unknown 
function f : X ^R. 

In this setting, just like in Section 4.1, our algorithm and the corresponding analysis are 
a straightforward consequence of the general results on individual sequences developed in 
Section 3. As in the random design setting, the sample {xi,Yi), . . . , {xt,Yt) is treated in 
a sequential fashion. Wc run the algorithm ScqSEW* defined in Figure 3 from time 1 to 
time T with the particular choice of r = 1/ VdT. We then define our data-based regressor 

r 1 ~ 

— ft{x) if a;G {xi,...,xr} , 
Mx) ^ { la^T (28) 

t:xt=x 

,0 if X ^ {xi, . . .,xt} , 

where Ux = \ {t : xt = x] \ = YlJ=i ^xt=x}: a^nd where the regressors /t : — >■ M sequentially 
built by the algorithm SeqSEW* are defined by 

ft{x)^ I [u-^ix)] pMu). (29) 

In the particular case when the xt are all distinct, fx is simply defined by fxix) — fxi^) if 
X G {xi, . . . , Xt} and by fxix) = otherwise. 

The next theorem is the main result of this subsection. It follows as in the random design 
setting from the deterministic regret bound of CoroUory 8 and from Jensen's inequality. The 
proof is postponed to Appendix A. 3. 
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Theorem 16 Consider the regression model with fixed design described in (27). Then, the 
data-based regressor /t defined in (28) -(29) satisfies 



E 



T 1 r ^ 

^ inf U^(/(xt)-tx.vp(xt)) 



+ 64- 



wIIq In 1 + 



'dT \\u\\ 



\u\\ 



j=i t=i 

As in Section 4.1, the amplitude E [maxi^^^T Y^^] can be upper bounded under various 
assumptions. The proof of the following corollary is postponed to Appendix A. 3. 

Corollary 17 Consider the regression model with fixed design described in (27). Assume 
that one of the following assumptions holds on the distribution of Si . 

• (BD(B)) : |£i| ^ S almost surely for a given constant B > 0; 

• (SG((j^)) : ei is subgaussian with variance factor cr^ > 0, that is, E [e^^i] ^ 
for all A G M; 

• (BEM(a,M)) : e has a bounded exponential moment, that is, E [e"'^'] ^ M for some 
given constants a > and M > 0; 

• (BM(a,M)) : e has a bounded moment, that is, E[|£|"] ^ M for some given constants 
a > 2 and M > 0. 

Then, the data-based regressor fx defined in (28) -(29) satisfies 

2 



E 



^E(/(^*) - fT{xt)f ^ mf^|^f](/(x*) - u . ^{xt)Y 



+ 



j2g ( maxi^t^r /^(a^t) 



+ il)T \\u\L In 1 + 



llttlln 



d T 

+ ^2^Z^'^i(^*)+ 64( + ^t1 , 

7 = 1 t=l ^ ^ 



where 



( 

~T 

2cj2 ln(2er) 



V'T = ^E 



max Ef 



^ < 



In^ ((M + e)T) 



I. J'(a-2)/a 



if Assumption (BD(B)) holds, 
if Assumption (SG(cr^)) holds, 
if Assumption (BEM(a, M)) holds, 
if Assumption (BM(q;,M)) holds. 
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The above bound is of the same flavor as that of Dalalyan and Tsybakov (2008, Theo- 
rem 5). It has one advantage and one drawback. On the one hand, we note two additional 
"bias" terms (maxi^t^T f^{xt)) /T as compared to the bound of Dalalyan and Tsybakov 
(2008, Theorem 5). As of now, we have not been able to remove them using ideas simi- 
lar to what we did in the random design case (see Remark 19 in Appendix A. 2). On the 
other hand, under Assumption (SG((T^)), contrary to Dalalyan and Tsybakov (2008), our 
algorithm does not require the prior knowledge of the variance factor of the noise. 
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Appendix A. Proofs 

In this appendix we provide the proofs of some results stated above. 
A.l Proofs of Theorem 10 and Corollary 11 

Before proving Theorem 10, we first need the following comment. Since the algorithm 
SeqSEW* is restarted at the beginning of each regime, the threshold values Bt used on 
regime r by the algorithm SeqSEW* are not computed on the basis of all past observations 
yi, . . . , yt-i but only on the basis of the past observations yt, t G {U-i + 1, . . . ,t — 1}. To 
avoid any ambiguity, we set 

B,^t 4 (^sn^g^ma^v-i+i^^^t-i^/^iy^^ , t G {tr-l + 1, . . . . (30) 

Proof (of Theorem 10) We denote by i? = niinjr G N : T ^ t^} the index of the last 
regime. For notational convenience, we re-define tn = T (even if ^ 2^). 

We upper bound the regret of the algorithm SeqSEW* on {1, ... , T} by the sum of its 
regrets on each time interval. To do so, first note that 

T R tr R / tr-l 

^{yt - ytf = X] ~ = X] ^y^r - ytrf + ^ [yt- ytf 

t=l r=Ot=tr-i+l r=0 y t=tr-l+l 

E {yt-yt?\ (31) 

r=0 y t=tr-l+l / 

^El E {yt-ytA+m + ^T. (32) 
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where we set = maxi^t^T \yt\, where (31) foUows from the upper bound {yt^ — yt^)'^ ^ 
2(yf^ ^ 2(j/j2^ + B^^J (since \yt^\ ^ Br,tr by construction), and where (32) follows 

from the inequality y^^ ^ j/^^ and the fact that 

o2 A 2riog2max,^_^+i^,^,,_i2/?l ^ 3 max yf ^2yX.\ 

But, for every r = 0, . . . , -R, the trace of the empirical Gram matrix on {tr-i + 1, . . . , ~ 1} 
is upper bounded by 

E E^,^(x.)^EI:^.'(-*)^(-'^-l)^ 

t=tr-l+l j = l t=l j = l 

where the last inequality follows from the fact that 7t^_i ^ 2'^' (by definition of t^). Since in 
addition Tr = l/\/(e^'' — 1)^, we can apply CoroUory 7 on each period {U-i + 1, . . . , — 1}, 
r = 0, . . . ,R, with = (e^'^ — 1)^ and get from (32) the upper bound 



T 



R 



t=l 

where 



r=0 



E {yt - U ■ (p{xt)) +Ar{u) 

t=tr-l + l 



} +6{R + l)y*T\ (33) 



. . X A o „ „ / fe^'^ - 1) llwlli 
Ar{u) ^ 32Bl ||u||oln 1 + ^ " 

\ W n 



+ 16S,^,t. + 1 . 



(34) 



Since the infimum is superadditive and since (yt^ — u ■ (p{xt^))^ ^ for all r = 0, . . . , i?, we 
get from (33) that 



R 



^{yt - yt? ^ inf^ E E - ^ • + ^r{u) +Q{R + l)y^' 

t=l r=0 \t=tr-i+l I 

= inf (V(yt-W(^(xt))' + VA,(1/)| +6(i2 + l)y^2_ (35) 

Let u G M'^. Next we bound ^)2^=Q^r{u) and 6(i? + ^y^ from above. First note 
that, by the upper bound -B^tr ^ 2y^^ and by the elementary inequality ln(l + xy) ^ 
In ((1 + x)(l + y)) = ln(l + x) + ln(l + y) with a; = e^'^ — 1 and y = ||ti||j^/||ti||Q, (34) yields 

A^(u) ^64y|,2||u||o2'^ + 64yf,2||u||oln(^l + ^|^j + 32y^2 + 1 . 
Summing over r = 0, . . . , i?, we get 

^A,(w) < 64 (2^+1-1) y5,^||w||o + (i?+ 1) (^64^^.^ ||w||oln (^1 + + 32y^ V 1^ . 

(36) 
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First case: R = 

Substituting (36) in (35), we conclude the proof by noting that At ^ 2 + log2 1^1 and 
that ln(e + E?=i ^'(^*) ) ^ 

Second case: R'^ 1 

Since 1, we have, by definition of Ir-i, 
2 



\ t=i j=i 




< 7tK_i = In + 
The last inequality entails that 2^+^ - 1 ^ 4 • 2-^-1 ^ 41n^e + 

that i? + 1 ^ 2 + log2 In + y'^E^TE^T^fO^ ) — ^t- Therefore, one the one hand, 
via (36), 



R f ^ ^ \ / II II 

^A,(u)^ 256y^.'||u||oln e+ EE^|(^*) + 64yr'^T ||w||o In ( 1 + 
r=o \ \ t=i j=i J ^ ll'^llo 

+ ^T(32yf.' + l) , 

and, on the other hand, 

6(i? + l)yT^ ^ GArVT^ . 

Substituting the last two inequalities in (35) and noting that y^^ = maxi^^^j' concludes 
the proof. ■ 

Proof (of Corollary 11) The proof is straightforward. In view of Theorem 10, we just 
need to check that the quantity (continuously extended in s = 0) 



T d 



u 



is non-decreasing in s G M+ and in J7 € M+. 

This is clear for U. The fact that it also non-decreasing in s comes from the following 
remark. For all U ^ 0, the function s G (0, -|-oo) i->- s ln(l + U/s) has a derivative equal to 



' s J 1 + U/s 



for all s > . 



From the elementary inequality 



ln(l + n) = -ln( — 1 



1 + n 



1 + u 



u 



l + u 



which holds for all « G (— 1, +oo), the above derivative is nonnegative for all s > so that 
the continuous extension s G IR+ s In (1 + J7/s) is non-decreasing. ■ 
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A. 2 Proofs of Theorem 12 and Corollary 13 

In this subsection, we set £ = y — f{X), so that the pairs {Xi,£i), . . . , {Xt,st) are inde- 
pendent copies of {X, e) e X xR. We also define cr ^ by 

a2 4e[£2] =E[(y-/(X))2] . 



Proof (of Theorem 12) By CoroUory 8 and the definitions of ft above and yt = ft{Xt) 
in Figure 3, we have, almost surely, 



d T 



+ if^iXt) + 32 max . 



j=l t=l 



It remains to take the expectations of both sides with respect to (^{Xi,Yi), . . . , {Xt,Yt)). 
First note that for alH = 1, . . . , T, since et = Yt — f{Xt), we have 



E 



(y* - ft{Xt)f] = E \{st + f{Xt) - ft{Xt)f 



(T^ + E 



{f{Xt)-MXt)y 



since E[ef] = E[e^] = cr^ one the one hand, and, on the other hand, ft is a built on 
{Xs,Ys)i^s^t-i and E[et\{Xs,Ys)i^s^t-i, Xt] = E[et\Xt] = (from the independence of 
{Xs,Ys)i^s^t-i and {Xt,Yt) and by definition of /). 
In the same way, 



E 



{Yt-u-^{Xt)y 



{f{Xt)-u-^{Xt))' 



Therefore, by Jensen's inequality and the concavity of the infimum, the last inequality 
becomes, after taking the expectations of both sides. 



1 

Y,^[{f{Xt)-UXt))'' 



t=\ 



^ inf <; Ta^ + 

ui 



+ 64E 



Y,K[{fiXt)-u-^iXt)f_ 

dT \\u\\i 



max Y/ 

d T 



\u\\q In 1 + 



3=1 t=l 



\u\\ 



max Y? 
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Noting that the Ta^ cancel out, dividing the two sides by T, and using the fact that Xt X 
in the right-hand side, we get 



^j2^[{f{Xt)-ft{Xt)y 
t=i 



^ inf S \\f-u-(p\\^2 



+ 64 



E [maxi^tjcT Y^^ 



IwlL In 1 + 



\u\ 



+ -g^2^m\\L^ + 32 . 



The right-hand side of the last inequality is exactly the upper bound stated in Theorem 12. 
To conclude the proof, we thus only need to check that ||/ — /t|||2 is bounded from above 
by the left-hand side. But by definition of /r and by convexity of the square loss we have 



E 



/-/r 



L2 



^ t=i ^ 



The last equality follows classically from the fact that, for all t = 1, . . . , T, {Xs^Ys)\i^s<it-\ 
(on which is constructed) is independent from both Xt and X and the fact that ~ X. ■ 



Remark 18 The fact that the inequality stated in Corollary 8 has a leading constant equal 
to 1 on individual sequences is crucial to derive in the stochastic setting an oracle inequality 



in terms of the (excess) risks E ||/ — /r||i2 



anc 



u-^p\\i^2- Indeed, if the constant 



appearing in front of the infimum was equal to C > 1, then the Ta^ would not cancel out in 
the previous proof, so that the resulting expected inequality would contain a non-vanishing 
additive term (C — l)a^. 



Proof (of Corollary 13) We can apply Theorem 12. Then, to prove the upper bound 



on E 



II/-/: 



, it suffices to show that 



(37) 



Recall that 



max (Yt - E[Yt] 



T 



max (Ay)f 



where we defined (Ay)* = Yt- E[Yt] =Yt- E[Y] for all t = l,. 
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Prom the elementary inequality {x + y)^ ^ 2x^ + 2y^ for all x, y G M, we have 



E 



max Yf 



E 



max (E[Y] + (AY)ty 



^ 2E[y]^ + 2E 



max (AY); 



(38) 



Dividing both sides by T, we get (37). 



As for the upper bound on ipT, since the {AY)t, 1 ^ t ^ T, are distributed as AY, 
we can apply Lemmas 24, 25, and 26 in Appendix B.3 to bound from above under 
the assumptions (SG((72)), (BEM(a, M)) , and (BM(a,M)) respectively (the upper bound 
under (BD(B)) is straightforward): 



E 



max (Ay) 



< < 



(7^ + 2(7^ ln(2er) 
ln2((M + e)r) 



if Assumption (BD(B)) holds, 
if Assumption (SG(cj^)) holds, 

if Assumption (BEM(a, M)) holds, 
if Assumption (BM(a, M)) holds . 



Remark 19 If T ^ 2,then the two "bias" terms ¥.[Yf /T appearing in Corollary 13 can 
he avoided, at least at the price of a multiplicative factor of2T/[T — 1) ^ 4. It suffices to 
use a slightly more sophisticated online clipping defined as follows. The first round t = 1 is 
only used to observe Yi. Then, the algorithm SeqSEW* is run with r = l/^yd{T — 1) from 
round 2 up to round T with the following important modification: instead of truncating the 
predictions to [—Bt,Bt], which is best suited to the case K[Y] = 0, we truncate them to the 
interval 

[Yi - B't, Yi + B't] , where B[ ^ (^2^^og^^^^^s^,., IFs-YiPl^/^ _ 

If r]t is changed accordingly, i.e., ifrjt = 1/(85^^), then it easy to see that the resulting pro- 
cedure fx — 2^ X^r=2 fs (where /2, • • ■ , /t o-re the regressors output by SeqSEW* ) satisfies 









2 ' 




E 




f-fT 










L2 


ue^d- 1 



where Var\Y\ = E[(y — E[y])^] . Comparing the last bound to that of Corollary 13, we note 
that the two terms ¥.\Yf /T are absent, and that we loose a multiplicative factor at most 
of A since Var\Y] ^ ¥.[ins^2^t^T{Yt - ¥.\Yt]f] = (T- 1)V't-i so that 



32 



Sparsity Regret Bounds for Individual Sequences 



Remark 20 We mentioned after Corollary 13 that each of the four assumptions on AF is 
fulfilled as soon as both the distribution o//(X) — E[/(X)] and the conditional distribution of 
£ ( conditionally on X) satisfy the same type of assumption. It actually extends to the more 
general case when the conditional distribution of e given X is replaced with the distribution 
of e itself (without conditioning). This relies on the elementary upper bound 



E 



max (AY) 



E m^^{f{Xt)-E[f{X)] + ety 



m^^{fiXt)-E[f{X)]y +2E 



max e: 
l<t€T ' 



Prom the last inequality, we can also see that assumptions of different nature can be made 
on f{X) — E[/(X)] and e, such as the assumptions given in (24) or in (25). 

A. 3 Proofs of Theorem 16 and Corollary 17 

Proof (of Theorem 16) The proof follows the sames lines as in the proof of Theorem 12. 
We thus only sketch the main arguments. In the sequel, we set = E[£:f]. 



Applying Corollory 8 we have, almost surely, 

T ( T 



Y.{Yt - ft{xt)y ^ ^{Yt - u ■ cp{xt)y + 64( m<^^y/ J ^ + 



, t=i 

d T 



j=i t=i 



UlT \\u\\-^_ 
llwL 



Taking the expectations of both sides, expanding the squares {Yt — ftixt))^ and {Yt — u ■ 
ip{xt)Y , noting that two terms Tcr^ cancel out, and then dividing both sides by T, we get 



E 



T '\ ( T 



+ 64 



E [maxi^t^T Y^ 



\u\\q In 1 + 



'dT \\u\\^ 



\u\ 



, 1 V"V- 2/ , , ^^ E[maxi^t<Tyt^] 



j=\ t=i 



The right-hand side is exactly the upper bound stated in Theorem 16. We thus only need 
to check that 



E 



T 1 r ^ 



t=i 



t=i 



(40) 
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This is an equality if the xt are all distinct. In general we get an inequality which follows 
from the convexity of the square loss. Indeed, by definition of Ux, we have, almost surely, 

j2{f{xt)-fT{xt)Y= Yl E (/(^*)-M^*))'= E n4f{x)-Mx)f 

^=1 xe{xi,...,XT} l^t^T x£{xi,...,xt} 

t:xt=x 

= E f /(^) - ^ E ^(^) 

xe{xi,...,XT} V ^ is;t<T 

t:xt=x 

^ E ;r ^ ^^^""^ ~ = E(/(^*) - fti^t))' , 

xe{xi,...,XT} ^ l^t^T t=l 

t:xt=x 

where the second line is by definition of Jt and where the last line follows from Jensen's 
inequality. Dividing both sides by T and taking their expectations, we get (40), which 
concludes the proof. ■ 



Proof (of Corollary 17) First note that 



E 



max Y? 



E 



max (f{xt) + ety 



^ 2 ( max f(xt)+E 



max £ 



The proof then follows the exact same lines as for Corollary 13 with the sequence (sj) in- 
stead of the sequence ((Ay)t). ■ 



Appendix B. Tools 

Next we provide several (in)equalities that prove to be useful throughout the paper. 
B.l A duality formula for the KuUback-Leibler divergence 

We recall below a key duality formula satisfied by the Kullback-Leibler divergence and 
whose proof can be found, e.g., in the monograph by Catoni (2004, pp. 159-160). We use 
the notations of Section 2. 

Proposition 21 For any measurable space {G,B), any probability distribution tt on {G,B), 
and any measurable function h : Q [a, +00) bounded from below (by some a G Mj, we 
have 

-In [ e-'*d7r= inf | / hdp + IC{p,Tr) 

where M.i{@) denotes the set of all probability distributions on {Q,B), and where the ex- 
pectations JqH^p & [a, +00] are always well defined since h is bounded from below. 
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B.2 Some tools to exploit our PAC-Bayesian inequalities 

In this subsection we recall two results needed for the derivation of Proposition 1 and Propo- 
sition 5 from the PAC-Bayesian inequalities (7) and (16). The proofs are due to Dalalyan 
and Tsybakov (2007, 2008) and we only reproduce^ them for the convenience of the reader. 

For any tt* G M'^ and r > 0, define Pu*,r as the translated of tTt at u*, namely, 

d 



A 



d7i"T/ *x , TT (3/r)dn,- 

f^^2{l + \u,-u*\/Ty 

Lemma 22 For all u* G and r > 0, the probability distribution Pu*,r satisfies 

„ T T (I T 

■^^'^ t=l t=l j=l t=l 

Lemma 23 For all u* G M"' and r > 0, the probability distribution Pu*,T satisfies 



IC{pu*,r,Trr) ^ 4||u*||oln (^1 + p^^) 



Proof (of Lemma 22) For all t G {1, . . . , T} we expand the square [yt — u ■ (p{xt)y 
(jjt — u* ■ (p{xt) + («* — u) ■ (p{xt)Y and use the linearity of the integral to get 



/ 



T 



Y^{yt - u ■ ip{xt)) Pu^A^u) (42) 
t=i 

T T „ 

= ~ ■ + / ((■"* ~ ■ ¥'(a3t)) V*,r(dw) 

t=l t=l -^K" 

T 

+ y2 ~ ■ 'Pi^t)) / {U* - U) ■ (fi{xt) Pu*,T{du) 

t=l 

=0 

The last sum equals zero by symmetry of pu*,T around u*, which yields / u pu*,T{du) = u*. 

As for the second sum of the right-hand side, it can be bounded from above similarly. 
Indeed, expanding the inner product and then the square ((w* — it) • ^{xt)) we have, for 
alH = 1,...,T, 

d 

{{u* - u) ■ f{xt)f = Y^{u* - Ujfip]{xt) + Yl K ~ - Uk) Vj{xt) Mxt) ■ 



8. The notations arc however slightly modified because of the change in the statistical setting and goal. 
The target predictions (/(xi), . . . , /(xr)) are indeed replaced with the observations (yi, . . . , j/t) and 
the prediction loss || / — fu\\n is replaced with the cumulative loss Yl'^=iiyt ~ " vi^t))^- Moreover, 
the analysis of the present proof is slightly simpler since we just need to consider the case Lq = +oo 
according to the notations of Theorem 5 by Dalalyan and Tsybakov (2008). 
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By symmetry of /9u*,r around u* and the fact that Pu*,T is a product-distribution, we get 



T d 



T d . 
t=l 7 = 1 •^'■^ 



3 

^ " (3/T)d^X 



j=i - 2{1 + \uj-U*\/T) 

T d 



t=l i=l 

where (43) follows by definition of Pu*,T, where (44) is obtained by the change of variables 

t = (uj — u*A/t, and where (45) follows from the equality / j = 1 that can be 

jR2{l + \t\) 

proved by integrating by parts. Substituting (45) into (42) concludes the proof. ■ 
Proof (of Lemma 23) By definition of Pu*,T and tTt, we have 

fCipu*,T,Trr) = [ (ln^^^^{u)\ pu',ridu) = [ f In TT ^^ "^ I (dw) 

= (^in )pu^A<^u). (46) 

Jmd\j^^ 1 + \uj-u*\/tJ 

But, for all u G M*^, by the triangle inequality, 

1 + \Uj\/t ^ 1 + \u*\/t + \Uj - U*\/t ^ (1 + \u*\/t) (1 + \Uj - u*\/t) , 

so that Equation (46) yields the upper bound 

d 

/C(p^.^.,7r,) ^4^1n(l + |t/*|/r) =4 ^ In (l + ^l/r) . 

We now recall that ||w*||o — \{j '■ u* / 0}| and apply Jensen's inequality to the concave 
function x G (—1, +00) 1 — > ln(l + x) to get 

ln(l + K|/r) = KI|o^ ln(l + |^*|/r)^K||oln(l + 5j^^^'^ 



^ ||w lloln I 1 + 



,-.,,.^0 V " "0 

u* 



This concludes the proof. 
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B.3 Some maximal inequalities 

Next we prove three maximal inequalities needed for the derivation of Corollaries 13 and 17 
from Theorems 12 and 16 respectively. Their proofs are quite standard but we provide them 
for the convenience of the reader. 

Lemma 24 Let Zi, . . . , Zt beT ^ 1 (centered) real random variables such that, for a given 
constant v 0, we have 



Then, 



Vi G {l,...,r}, VAgR, E 

^ 2v ln(2eT) . 



(47) 



E 



max Z? 



Lemma 25 Let Zi,. . . , Zt be T 1 real random variables such that, for some given con- 
stants a > and M > 0, we have 



Then, 



ViG{i,...,r}, E 

ln2((M + e)r) 



< M . 



E 



max Z? 



Lemma 26 Let Zi, . . . , Zt be T 1 real random variables such that, for some given con- 
stants a > 2 and M > 0, we have 



Then, 



ViG{l,...,r}, E[\Zt\^]^M. 

^ {MTf/'^ . 



E 



max Zf 



Proof (of Lemma 24) Let t G {1, . . . ,T}. From the subgaussian assumption (47) it is 
well known (see, e.g., Massart 2007, Chapter 2) that for all a; ^ 0, we have 



Vt G {l,...,r} , ¥{\Zt\>x) <:2e-^'- 



7(2^) 



Let 5 G (0,1). By the change of variables x = y^2z^ln(2T/(5), the last inequality entails 
that, for all t = 1, . . . , T, we have \Zt\ ^ y^2zv ln(2T/(5) with probability at least 1 — 5/T. 
Therefore, by a union bound, we get, with probability at least 1 — 5, 



Vi G {l,...,r} , \Zt\^ ^2vhi{2T/5) . 
As a consequence, with probability at least 1 — (5, 



m^ax Z| ^ 2v\n{2T/5) ^ 2v\n(l/5) +2v\n(2T) 
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It now just remains to integrate the last inequality over 5 G (0, 1) as is made precise below. 
By the change of variables 6 = e~^, the latter inequality yields 



Vz > , 



fmaxi^t^T - 2u ln(2T) ^ 



V" 



2u 



> z 



(48) 



where for all x G R, x+ = max{x, 0} denotes the positive part of x. Using the well-known 
fact that E[^] = J^°° P(^ > z)dz for all nonnegative real random variable ^, we get 



E 



2u 



^ E 



Zf - 27/lii(2r) 



2u 



+ 



-L 



2v 



dz 



f + 00 

^ / e-Mz = 1 , 



where the last line follows from (48) above. Rearranging terms, we get E [maxi^^^r ^ 
2v + 2v ln(2r) , which concludes the proof. ■ 



Proof (of Lemma 25) We first need the following definitions. Let ^/'a : 
convex major ant of x i— >■ ^P^^f^ on M_|_ defined by 



be a 



e if a; < , 

e"^ if X ^ l/a^ . 



We associate with its generalized inverse 



-1 . 



Elementary manipulations show that: 

• 'ipcc is nondccrcasing and convex on M_|_; 

• ■^"^ is nondecreasing on R; 

• X ^ {ipaix)) for all X G R+. 



M-i- defined by 

if y < e , 
if y > e . 



The proof is based on a Pisier-type argument as is done, e.g., by Massart (2007, Lemma 2.3) 
to prove the maximal inequality E[maxi^t^T ^t] ^ \/2z/ In T for all subgaussian real random 
variables ^j, 1 ^ t ^ T, with common variance factor u ^ 

From the inequality x ^ {il)a{x)) for all x £ IR+ we have 

E 



'72 

max Zf 



^ M V'a E max 



i)a [ max zf 



max ibr. [Z^ \ 
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where the last two incquahties follow by Jensen's inequality (since tpa is convex) and the 
fact that both ■^"^ and ipa are nondecreasing. 



Since tpa ^ and ip^ ^ is nondecreasing we get 



E 



max 



.t=i 



a=i 



K]E[e"l^'l+e' 



Kt=l 



^^-^{MT + eT) = 



In^ (MT + eT) 



a' 



where the second line follows from the inequality ijjaix) ^ e + e"^ for all x € M+, and where 
the last line follows from the bounded exponential moment assumption and the definition 
of V'a''^- It concludes the proof. ■ 



Proof (of Lemma 26) As in the previous proof, we have, by Jensen's inequality and the 
fact that X ^ is convex and nondecreasing on M+ (since a > 2), 



E 



max 



^ E 



< E 



( 



max 



a/2 



2/ a 



E 



-1 2/a 



max \Zt\ 



,t=i 



2/a 



^ {MT) 



2/a 



by the bounded-moment assumption, which concludes the proof. 
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